This Exploratory Data Analysis is on quality of red wines and what influences its quality. I am keen to find out which chemicals affect the most on final quality ratings of red wines, 1 to 10. This will eventually help what factors to look out for when choosing a quality red wine. The data was obtained through Udacity Data Analysis Nanodegree website. However, it is available in various internet sources including Kaggle.
Number of data (1599) and number of variables (13)
## [1] 1599 13
Names of columns, including quality and measure of different chemicals as variables
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Summary including min, max, 1st and 3rd Qs etc for each variable
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Further study about each variable formats. Ensuring what may be needed and what may be not, which variable’s factor may need to be changed. For example, the column ‘X’ will be removed since this is mearly an index and does not add any value. Also the ‘quality’ ratings which is in int, will be utilised to add a new column called ‘quality_rating’.1 to 4 will be Bad, 5 to 6 will be Average and 7 to 10 will be Good (If 0 to 2 was available, they would have been marked as Worst Similarly, if 9 and 10 was available, they would have been marked as Excellent).
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Distribution of ratings of red wines in numbers Actual number of each rating
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Below is a bar graph of ratings in classification of Bad, Average and Good. It is noted that majority is having Average rating. This may affect the quality of meaningful exploratory data analysis on wine with ratings of Bad and Good. Count of each rating
##
## Bad Average Good
## 63 1319 217
Fixed acidity is one of the key chemicals that determines the quality of red wine. It shows max value at 7.2 and is skewed to the right.
Some outliers for this and the rest of plots are removed.
##
## 4.6 4.7 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1
## 1 1 1 6 4 6 4 5 1 14 2 4 9 13 16
## 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6
## 20 14 25 17 37 28 46 38 50 57 67 44 44 52 46
## 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1
## 49 53 42 42 26 45 40 26 19 27 24 34 33 26 29
## 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6
## 16 22 17 14 17 9 15 26 23 10 19 11 21 12 14
## 10.7 10.8 10.9 11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12 12.1
## 10 10 8 3 9 5 7 5 13 12 3 3 12 7 1
## 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 13 13.2 13.3 13.4 13.5 13.7 13.8
## 4 5 4 7 4 4 5 2 3 3 3 1 1 2 1
## 14 14.3 15 15.5 15.6 15.9
## 1 1 2 2 2 1
Volatile acidity is also one of the important acidity determining the quality of red wine. This also is skewed to the right but reasonably well bell-shaped. Citiric acidity is the third acidity determining the quality of red wine. It shows shape with three highlights - near 0, 0.22 and 0.45.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
Regular sugar has a very long tail with skewed to the right shape of distribution. The max value is at 2.1.
##
## 0.9 1.2 1.3 1.4 1.5 1.6 1.65 1.7 1.75 1.8 1.9 2 2.05 2.1 2.15
## 2 8 5 35 30 58 2 76 2 129 117 156 2 128 2
## 2.2 2.25 2.3 2.35 2.4 2.5 2.55 2.6 2.65 2.7 2.8 2.85 2.9 2.95 3
## 131 1 109 1 86 84 1 79 1 39 49 1 24 1 25
## 3.1 3.2 3.3 3.4 3.45 3.5 3.6 3.65 3.7 3.75 3.8 3.9 4 4.1 4.2
## 7 15 11 15 1 2 8 1 4 1 8 6 11 6 5
## 4.25 4.3 4.4 4.5 4.6 4.65 4.7 4.8 5 5.1 5.15 5.2 5.4 5.5 5.6
## 1 8 4 4 6 2 1 3 1 5 1 3 1 8 6
## 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.55 6.6 6.7 7 7.2 7.3 7.5
## 1 4 3 4 4 3 2 3 2 2 2 1 1 1 1
## 7.8 7.9 8.1 8.3 8.6 8.8 8.9 9 10.7 11 12.9 13.4 13.8 13.9 15.4
## 2 3 2 3 1 2 1 1 1 2 1 1 2 1 2
## 15.5
## 1
Chloroides also is skewed to the right with a long tail on the right. The max value is at 0.074.
##
## 0.012 0.034 0.038 0.039 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048
## 2 1 2 4 4 3 1 5 4 4 4 8
## 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 0.059 0.06
## 8 12 1 10 5 13 8 9 10 14 17 16
## 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07 0.071 0.072
## 11 24 22 20 23 32 27 30 21 35 47 24
## 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082 0.083 0.084
## 35 55 45 51 47 51 43 66 40 46 35 49
## 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094 0.095 0.096
## 25 31 25 32 25 21 19 22 21 19 23 18
## 0.097 0.098 0.099 0.1 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108
## 18 12 8 13 5 10 7 16 6 8 9 1
## 0.109 0.11 0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118 0.119 0.12
## 3 8 7 6 1 11 5 2 4 8 3 3
## 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.132 0.136 0.137 0.143
## 2 7 6 3 1 1 1 1 4 1 1 1
## 0.145 0.146 0.147 0.148 0.152 0.153 0.157 0.159 0.161 0.165 0.166 0.168
## 1 1 1 1 2 1 3 1 1 1 3 1
## 0.169 0.17 0.171 0.172 0.174 0.176 0.178 0.186 0.19 0.194 0.2 0.205
## 1 1 2 1 1 1 2 1 1 1 1 2
## 0.213 0.214 0.216 0.222 0.226 0.23 0.235 0.236 0.241 0.243 0.25 0.263
## 1 3 1 1 2 1 1 1 1 1 1 1
## 0.267 0.27 0.332 0.337 0.341 0.343 0.358 0.36 0.368 0.369 0.387 0.401
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0.403 0.413 0.414 0.415 0.422 0.464 0.467 0.61 0.611
## 1 1 2 3 1 1 1 1 1
Free sulfur is maxed at 5, skewed to the right
##
## 1 2 3 4 5 5.5 6 7 8 9 10 11 12 13 14
## 3 1 49 41 104 1 138 71 56 62 79 59 75 57 50
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 78 61 60 46 39 30 41 22 32 34 24 32 29 23 23
## 30 31 32 33 34 35 36 37 37.5 38 39 40 40.5 41 42
## 16 20 22 11 18 15 11 3 2 9 5 6 1 7 3
## 43 45 46 47 48 50 51 52 53 54 55 57 66 68 72
## 3 3 1 1 4 2 4 3 1 1 2 1 1 2 1
Total sulfur also shows skewed to the right. Density is probably the only well normally distributed compared to others.
Overall distribution of pH is between 2.9 and 3.7. This shows all the red wine are very acidic.
Sulphates is skewed to the right, with max value just under 0.6.
Alcohol level shows very concentrated at around 9.5.
There are 1599 wines in the dataset with 12 features (13 variables but the first variable X is just a count). Features are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality.
All factors are numbers in either double/float or integer format, except ‘quality’ which is integer that was changed to factor. This is due to the fact that ‘quality’ is classification, not a measure.
Observations: 1) Quality (10 being best, 0 being worst): This dataset contains categories only from 3-8. Highest is 5, with 681 out of 1599. Followed by 6, with 638 and then 7 with 199. Lowest was 3 with 10. 75% has quality better than 5 (mid-point) and top 25% has quality 6 or better (6 is median quality) 2) Fixed acidity: Highest at 7.2 3) Volatile acidity: Interquartile range of 0.39 to 0.64. 4) pH: Ideal range of pH is around 3.2 and 3.7 (refernce). Interquartile is 3.21 and 3.4, which is ideal. 25% is between 3.4 and 4.01.
I would like to find out the relationship between the level of acidity and the quality. I would also like to find out how the combintion of acidity and other factors affect the overall quality of the red wine.
Residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, sulphates and alcohol. But mainly pH.
A new variables under the column of ‘quality_rating’ is created. This is to group the qualities into three: 1) 0 to 4 : Bad 2) 5 to 6 : Average 3) 7 to 10 : Good
In many graph analysis, outliers are removed. This is to ensure the graphs are more focusing on values it is concentrating. Once outliers are removed, it seems less skewed.
Some shows stronger correlations than others such as: 1) Quality and Alcohol : 0.5, Moderate 2) Quality and Volatile acidity : -0.4, Moderate There are other combinations that shows strong correlation (either positive or negative), however if they are not quality related, I am not focussing for this EDA.
##
## Pearson's product-moment correlation
##
## data: redwine_pf$quality and redwine_pf$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
As the correlation figure showed above (0.4761), as the quality goes up, the alcohol level goes up as well. There shows a slight drop from quality 4 to 5, however, this might be due to the fact that the dataset do not have enough samples for red wines with quality 3 and 4 (7 and 8 also do not have many samples but more than 3 and 4), it may have been distorted.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
The above shows summary of ‘Bad; quality wines in terms of alcohol. Both mean and median is lower than the ’Good’ quality wines. The max value is lower than ‘Average’ value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
Although median is same as ‘Bad’ quality wines, mean is higher (possibly due to higher max value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
Both median and mean is higher than ‘Bad’ and ‘Average’ alcohols. Interestingly, max value is lower. Graphical representation of above.
Another factor that strongly affects the quality, which is volatile acidity.
##
## Pearson's product-moment correlation
##
## data: redwine_pf$quality and redwine_pf$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
This shows gentle downwards trend, which proves its negative correlation between volatil acidity and quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
Although not as strong, sulphates is another chemical that also shows reasonably stronger correlation than other chemicals by showing 0.251 positive correlation. As expected, as the quality increases, the sulphates level also goes up. Once again, I would expect steaper curve, if there were as much data for quality 3 & 4 (‘Bad’) and 7 & 8 (‘Good’).
##
## Pearson's product-moment correlation
##
## data: redwine_pf$quality and redwine_pf$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4950 0.5600 0.5922 0.6000 2.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6473 0.7000 1.9800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
The below shows correlation of three different acid level with quality. As they all represent ‘acidity’, I will later combine them, then compare. However, one chemical shows interesting result is the ‘volatile acidity’. Unlike the other two, it shows positive correlation with quality. This may need further investigation.
##
## Pearson's product-moment correlation
##
## data: redwine_pf$pH and redwine_pf$volatile.acidity
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1880823 0.2807254
## sample estimates:
## cor
## 0.2349373
##
## Pearson's product-moment correlation
##
## data: redwine_pf$pH and redwine_pf$fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
##
## Pearson's product-moment correlation
##
## data: redwine_pf$pH and redwine_pf$citric.acid
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5756337 -0.5063336
## sample estimates:
## cor
## -0.5419041
Hence, I tried Simpson’s paradox between volatile acidity and pH level. Simpson’s paradox is, a certain relationship shows for most of similar group, however it disappears with the other certain small group of data.
Standing out variables with quality in terms of correlation (according to the correlation table) are alcohol and volatile acid. Alcohol shows est. 0.5 positive correlation, which means the more alcohol the better quality. However, have to bear in mind that it is still a moderate positive correlation instead of strong or very strong correlation. Having boxplot for each quality groups (0-4, 5-6 and 7-10), the highest group definitely shows much higher level of alcohol, compared to the other two groups. Volatile acidity shows 0.4 negative correlation, which means the less volatile acidity the better quality of red wine. Having boxplot, it definitely shows downwards trends of the amount of volatile acidity, as the quality goes up. If any, I would look at sulphates level, which shows 0.3 positive correlation towards quality.
Alcohol and density shows negative correlation of circa 0.5. pH and fixed acidity shows negative correlation of circa 0.7. Density and fixed acidity shows positive correlation of circa 0.7. Citric acidity and volatile acidity shows negative correlation of circa 0.6. Citric acidity and fixed acidity shows positive correlation of circa 0.7. Volatile acidity and fixed acidity shows negaive correlation of circa 0.3.
Alcohol and volatile acidity shows the strongest relationship with quality.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "total.acidity"
The most related variables are acidity, alcohol and quality. I have added all the acidities (volatile, fixed and citric) and compared with increasing alcohol level, which is highly correlated to overall quality. With exception due to outliers for average quality, it demonstrates downwards trend of total acidity, as the alcohol and quality increases.
Total acidity level seems to be starting at higher level for each quality rating categories.
N/A
This EDA (Experimental Data Analysis) is to understand which chemical(s) would affect the overall quality of red wines. The original dataset contains 13 variables with 1599 observations. The variables contains different measurements including alcohol, pH, sulphates etc. It also contains the overall quality of red wine in the scale of 0 to 10 (0 being worst and 10 being best).
Due to very concentrated dataset for the average quality (5 and 6, which is 82.49% of total dataset, whereas bad quality is 3.9% and 13.57% of good quality), the EDA would not provide very accurate and generic investigation for different quality categories. This should be taken into consideration around how meaningful analysis could be obtained on wines other than average quality (5 and 6 score).
Alcohol is one of the variable which determines overall quality of the wines. It was observed that there is a general positive correlation between the amount of alcohol and overall quality of red wine and the percentage of alcohol mostly varies in between 9 and 13, according to this dataset.
Another factor that affects the quality of red wine in addition to alcohol, is the acidity level. The dataset contains acidity level for citric, fixed and volatile. I have combined these three and created a new variable called ‘Total acidity’. Considering there are some outliers for wines with ‘average’ quality rating, the general trend shows the less acidity, the better wine quality. ——
As my very first project in R, it has been a great pleasure to learn how easy it is yet very powerful tool to analyse the data. The graphical tool was very strong and fast, easy to add elements that will help me to understand the underlying message behind the numbers within the dataset. Depite my limitation in being able to have more factors added to the graphs and plots, I managed to understand the variables that affects the overall quality of wines, which are alcohol and acidity. What could have been better is to have full dataset with equal number of data elements for each quality rating score. It would have helped me to explore the data with more accuracy to understand how each variable helps increase or reduce the quality rating. Lastly, understanding and being able to enjoy wine more, would have helped me to learn the link between each variable. In that note, I am happy to start learning more about the red wine and by understanding the industry, it will help me to explore this dataset with more insights from real world.